Support Parquet Byte Stream Split Encoding #5293

mwlon · 2024-01-10T02:39:07Z

Which issue does this PR close?

Closes #4102.

This builds on the previous attempt at a PR: #4183

Rationale for this change

This brings us up to speed with the full set of Parquet encodings, I believe. It will also be important for the likely addition of f16 and fixed len byte arrays to the byte stream split encoding.

What changes are included in this PR?

implemented byte stream split encoding
benchmark suite for encodings
- Measured the performance as 3x faster than the previous PR's implementation. There are definitely more potential performance wins to be had with SIMD though.
tests for the new encoding, including a test file in parquet-testing created via pyarrow.

Are there any user-facing changes?

No API additions

mwlon · 2024-01-10T02:40:31Z

@tustvold

ggreco

I'm not part of the apache foundation but I will try to use your code ASAP in my application. I reviewed it and it seems good.

ggreco · 2024-01-10T10:57:25Z

parquet/src/encodings/encoding/byte_stream_split_encoder.rs

@@ -0,0 +1,76 @@
+use crate::basic::{Encoding, Type};


You need to add the apache license header to get it approved.

ggreco · 2024-01-10T10:57:50Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

@@ -0,0 +1,104 @@
+use std::marker::PhantomData;


You need to add the apache license header to get it approved.

pitrou · 2024-01-10T15:05:16Z

Measured the performance as 3x faster than the previous PR's implementation.

FTR, what are the actual numbers?

pitrou · 2024-01-10T15:13:33Z

parquet/src/arrow/arrow_reader/mod.rs

+                .downcast_ref::<Float64Array>()
+                .unwrap();
+
+            // This file contains floats from a standard normal distribution


You can probably do an exact comparison on a few values as well, given that the file isn't going to change :-)

pitrou · 2024-01-10T15:15:32Z

parquet/src/encodings/decoding.rs

+            ],
+            vec![f32::from_le_bytes([0xA3, 0xB4, 0xC5, 0xD6])],
+        ];
+        test_byte_stream_split_decode::<FloatType>(data);


Also add a test for DoubleType?

mwlon · 2024-01-10T23:10:09Z

FTR, what are the actual numbers?

On my machine, it was about 17us for floats and 36us for doubles, pretty much the same speed for encoding and decoding. These are all just short of 4GB/s. The implementation I built from was around 1.2GB/s.

tustvold

The code makes sense to me, and looks well tested, thank you

tustvold · 2024-01-11T12:06:10Z

parquet/src/arrow/arrow_reader/mod.rs

+        for batch in record_reader {
+            let batch = batch.unwrap();
+            row_count += batch.num_rows();
+            let f32_col = batch


FWIW you can use https://docs.rs/arrow-array/latest/arrow_array/cast/trait.AsArray.html to make this less verbose

tustvold · 2024-01-11T12:08:07Z

Perhaps we could update the README to no longer say that we don't support this encoding

mwlon · 2024-01-12T02:28:05Z

Updated the readme. Is this ready to merge now?

tustvold · 2024-01-12T10:23:40Z

Thanks again

ggreco · 2024-01-16T09:49:19Z

@tustvold I see the V50 release is from 4 days ago, and I supposed it should contain this change, but I tried to update to v50 and enable BYTE_STREAM_SPLIT and I get this error:

thread 'upload::tests::test_upload_sensor' panicked at 'called `Result::unwrap()` on an `Err` value: NYI("Encoding BYTE_STREAM_SPLIT is not supported")', /Users/gabry/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-50.0.0/src/column/writer/mod.rs:249:58

tustvold · 2024-01-16T10:10:39Z

The release was cut prior to this being merged, as it goes through an ASF mandated RC process that takes at least 3 days. You can view the changelog and/or the git history to see what made a particular release.

alamb · 2024-03-02T11:42:51Z

Tracking next release in #5453

simonvandel and others added 12 commits January 7, 2024 18:51

wip byte-stream-split

aa72add

decoding works

a4f98a3

impl split

f260b0b

clean up

a83862f

whitespace

436f579

remove println

844e7c3

get compiling after rebase

7f5083b

integration test, as one might call it

3eea8a7

update parquet-testing revision

f8a010b

encoding bench

5e2c30c

improve performance

48608d0

test fix

fd92ee4

github-actions bot added the parquet Changes to the parquet crate label Jan 10, 2024

mwlon mentioned this pull request Jan 10, 2024

Parquet: Implement support for Encoding::BYTE_STREAM_SPLIT #4183

Closed

ggreco approved these changes Jan 10, 2024

View reviewed changes

add apache headers

73f23d0

pitrou reviewed Jan 10, 2024

View reviewed changes

tustvold approved these changes Jan 11, 2024

View reviewed changes

tustvold changed the title ~~Byte stream split~~ Support Parquet Byte Stream Split Encoding Jan 11, 2024

one more test and readme update

08bc944

tustvold merged commit 4c3e9be into apache:master Jan 12, 2024
17 checks passed

tustvold mentioned this pull request Mar 1, 2024

Umbrella issue for parquet 2.6.0 support #223

Closed

alamb mentioned this pull request Mar 2, 2024

Release arrow-rs / parquet version (51.0.0 or 50.1.0) #5453

Closed

anjakefala mentioned this pull request Jul 12, 2024

Extend support for BYTE_STREAM_SPLIT to FIXED_LEN_BYTE_ARRAY, INT32, and INT64 primitive types #6048

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parquet Byte Stream Split Encoding #5293

Support Parquet Byte Stream Split Encoding #5293

mwlon commented Jan 10, 2024

mwlon commented Jan 10, 2024

ggreco left a comment

ggreco Jan 10, 2024

ggreco Jan 10, 2024

pitrou commented Jan 10, 2024 •

edited

Loading

pitrou Jan 10, 2024

pitrou Jan 10, 2024

mwlon Jan 12, 2024

mwlon commented Jan 10, 2024

tustvold left a comment

tustvold Jan 11, 2024

mwlon Jan 12, 2024 •

edited

Loading

tustvold commented Jan 11, 2024

mwlon commented Jan 12, 2024 •

edited

Loading

tustvold commented Jan 12, 2024

ggreco commented Jan 16, 2024

tustvold commented Jan 16, 2024 •

edited

Loading

alamb commented Mar 2, 2024

Support Parquet Byte Stream Split Encoding #5293

Support Parquet Byte Stream Split Encoding #5293

Conversation

mwlon commented Jan 10, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mwlon commented Jan 10, 2024

ggreco left a comment

Choose a reason for hiding this comment

ggreco Jan 10, 2024

Choose a reason for hiding this comment

ggreco Jan 10, 2024

Choose a reason for hiding this comment

pitrou commented Jan 10, 2024 • edited Loading

pitrou Jan 10, 2024

Choose a reason for hiding this comment

pitrou Jan 10, 2024

Choose a reason for hiding this comment

mwlon Jan 12, 2024

Choose a reason for hiding this comment

mwlon commented Jan 10, 2024

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jan 11, 2024

Choose a reason for hiding this comment

mwlon Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold commented Jan 11, 2024

mwlon commented Jan 12, 2024 • edited Loading

tustvold commented Jan 12, 2024

ggreco commented Jan 16, 2024

tustvold commented Jan 16, 2024 • edited Loading

alamb commented Mar 2, 2024

pitrou commented Jan 10, 2024 •

edited

Loading

mwlon Jan 12, 2024 •

edited

Loading

mwlon commented Jan 12, 2024 •

edited

Loading

tustvold commented Jan 16, 2024 •

edited

Loading